Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

DOI: 10.1201/9781003355205-3

C h a p t e r 3

De Novo Genome Assembly

3.1 INTRODUCTION TO DE NOVO GENOME ASSEMBLY

In the previous chapter, we discussed reads mapping to the reference genomes of organ-

isms which have available reference genome sequences and we also discussed the refer-

ence-based genome assembly. So, what if there is no reference genome available for an

organism or we need to sequence a genome of an unknown organism. In this case, aligning

reads to a reference genome is not possible and we shall assume no prior knowledge about

the genome of that organism or its length, or composition. Thus, de novo genome assembly

comes into play. It can also be used for a species with a solved reference genome for vari-

ant discovery if we need to avoid any bias created by a prior knowledge. De novo genome

assembly is a strategy to assemble a novel genome from scratch without the aid of a refer-

ence genome sequence. Because of the improvements in cost and quality of DNA sequenc-

ing, de novo genome assembly now is widely used specially in metagenomics for bacterial

and viral genome assembly from environmental and clinical samples.

The de novo genome assembly aims to join reads into a contiguous sequence called

a contig. Multiple contigs are joined together to form a scaffold and multiple scaffolds

can also be linked to form a chromosome. The genome assembly is made of the consen-

sus sequences. Both single-end and paired-end or mate-pair reads can be used in the de

novo assembly, but paired reads are preferred because they provide high-quality align-

ments across DNA regions containing repetitive sequences and produce long contigs by

filling gaps in the consensus sequence. Assembling the entire genome is usually challeng-

ing because of the presence of numerous stretched tandem repeats in the genome. These

repeats create gaps in the assembly. Gap problem can be overcome by deep sequencing,

which is sequencing a genome multiple times to provide sufficient coverage and sequence

depth, which increase the chance for read overlaps. The sequencing coverage is defined as

the average number of reads that align to or cover known reference bases of the genome

and it is estimated as follows [1]:

(

)

coverage

read length

number of reads

haploid genome length bp

(3.1)